NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Load and MLP-Aware Thread Orchestration for Recommendation Systems Inference on CPUs

https://doi.org/10.1145/3676641.3716003

Jain, Rishabh; Chou, Teyuh; Kayiran, Onur; Kalamatianos, John; Loh, Gabriel H; Kandemir, Mahmut T; Das, Chita R (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
Stash: A Comprehensive Stall-Centric Characterization of Public Cloud VMs for Distributed Deep Learning

https://doi.org/10.1109/ICDCS57875.2023.00023

Sharma, Aakash; Bhasi, Vivek M; Singh, Sonali; Jain, Rishabh; Gunasekaran, Jashwant Raj; Mitra, Subrata; Kandemir, Mahmut Taylan; Kesidis, George; Das, Chita R (July 2023, IEEE)

Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
more » « less
Full Text Available
Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

https://doi.org/10.1109/MICRO61859.2024.00091

Jain, Rishabh; Bhasi, Vivek M; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (November 2024, IEEE)

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2× embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
more » « less
Full Text Available
Towards SLO-Compliant and Cost-Effective Serverless Computing on Emerging GPU Architectures

https://doi.org/10.1145/3652892.3700760

Bhasi, Vivek M; Sharma, Aakash; Jain, Rishabh; Gunasekaran, Jashwant Raj; Pattnaik, Ashutosh; Kandemir, Mahmut Taylan; Das, Chita (December 2024, ACM)

Full Text Available
Optimizing CPU Performance for Recommendation Systems At-Scale

https://doi.org/10.1145/3579371.3589112

Jain, Rishabh; Cheng, Scott; Kalagi, Vishwas; Sanghavi, Vrushabh; Kaul, Samvit; Arunachalam, Meena; Maeng, Kiwan; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut Taylan; et al (June 2023, International Symposium on Computer Architecture 2023)

Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
more » « less
Stash: A comprehensive stall-centric characterization of public cloud VMs for distributed deep learning

Sharma, Aakash; Bhasi, Vivek; Singh, Sonali; Jain, Rishabh; Raj, Jashwant; Mitra, Subrata; Kandemir, Mahmut Taylan; Kesidis, George; Das, Chita (January 2023, Proceedings of the International Conference on Distributed Computing Systems)

Deep neural networks (DNNs) are increasingly popular owing to their ability to solve complex problems such as image recognition, autonomous driving, and natural language processing. Their growing complexity coupled with the use of larger volumes of training data (to achieve acceptable accuracy) has warranted the use of GPUs and other accelerators. Such accelerators are typically expensive, with users having to pay a high upfront cost to acquire them. For infrequent use, users can, instead, leverage the public cloud to mitigate the high acquisition cost. However, with the wide diversity of hardware instances (particularly GPU instances) available in public cloud, it becomes challenging for a user to make an appropriate choice from a cost/performance standpoint. In this work, we try to address this problem by (i) introducing a comprehensive distributed deep learning (DDL) profiler Stash, which determines the various execution stalls that DDL suffers from, and (ii) using Stash to extensively characterize various public cloud GPU instances by running popular DNN models on them. Specifically, it estimates two types of communication stalls, namely, interconnect and network stalls, that play a dominant role in DDL execution time. Stash is implemented on top of prior work, DS-analyzer, that computes only the CPU and disk stalls. Using our detailed stall characterization, we list the advantages and shortcomings of public cloud GPU instances for users to help them make an informed decision(s). Our characterization results indicate that the more expensive GPU instances may not be the most performant for all DNN models and that AWS can sometimes sub-optimally allocate hardware interconnect resources. Specifically, the intra-machine interconnect can introduce communication overheads of up to 90% of DNN training time and the network-connected instances can suffer from up to 5× slowdown compared to training on a single instance. Furthermore, (iii) we also model the impact of DNN macroscopic features such as the number of layers and the number of gradients on communication stalls, and finally, (iv) we briefly discuss a cost comparison with existing work.
more » « less
Full Text Available
Controlling Nucleation Pathways in Zeolite Crystallization: Seeding Conceptual Methodologies for Advanced Materials Design

https://doi.org/10.1021/jacs.1c11014

Jain, Rishabh; Mallette, Adam J.; Rimer, Jeffrey D. (December 2021, Journal of the American Chemical Society)

Full Text Available
Nanostructuring versus microstructuring in battery electrodes

https://doi.org/10.1038/s41578-022-00454-9

Jain, Rishabh; Lakhnot, Aniruddha Singh; Bhimani, Kevin; Sharma, Shyam; Mahajani, Varad; Panchal, Reena A.; Kamble, Mithil; Han, Fudong; Wang, Chunsheng; Koratkar, Nikhil (September 2022, Nature Reviews Materials)

Full Text Available
Bandgap Tuning in BaZrS ₃ Perovskite Thin Films

https://doi.org/10.1021/acsaelm.1c00575

Sharma, Shyam; Ward, Zachary; Bhimani, Kevin; Li, Kang; Lakhnot, Aniruddha; Jain, Rishabh; Shi, Su-Fei; Terrones, Humberto; Koratkar, Nikhil (August 2021, ACS Applied Electronic Materials)

Full Text Available
In situ healing of dendrites in a potassium metal battery

https://doi.org/10.1073/pnas.1915470117

Hundekar, Prateek; Basu, Swastik; Fan, Xiulin; Li, Lu; Yoshimura, Anthony; Gupta, Tushar; Sarbada, Varun; Lakhnot, Aniruddha; Jain, Rishabh; Narayanan, Shankar; et al (March 2020, Proceedings of the National Academy of Sciences)

Significance Historically, battery self-heating has been viewed negatively as an undesirable attribute. However, we report that battery self-heat, if properly controlled, can smoothen dendritic features in potassium metal batteries. This could open the door to high gravimetric and volumetric energy density potassium-ion batteries that could offer a sustainable and low-cost alternative to the incumbent lithium-ion technology.
more » « less

« Prev Next »

Search for: All records